Medical Insurance Cost Estimation

TL;DR

Polynomial regression (degree 2, R² = 0.81 test) outperforms linear baseline (R² = 0.63 test) on 1,338 insurance records. Smokers pay an average of $23,615 vs $8,434 for non-smokers — 2.8× more. BMI is the second strongest continuous predictor. Southeast region shows highest average charges at $14,735.

R² = 0.81 (test set)

1,338 Records

+29% vs Linear Baseline

Smoking = 2.8× higher cost

Python Machine Learning Polynomial Regression Feature Engineering Cost Estimation

Project Overview

Healthcare costs are a significant concern for individuals and insurance companies alike. This project builds a predictive model for medical insurance charges based on customer demographics and health factors, with direct applicability to both individual financial planning and insurance policy pricing.

The 1,338-record dataset includes age, BMI, smoking status, region, sex, and number of children as predictors of annual insurance charges. Linear regression was benchmarked first (R² = 0.63), then polynomial regression (degree 2) was applied after residual plots confirmed non-linearity in the BMI-charge and age-charge relationships — achieving R² = 0.81 on the test set, a 29% relative improvement in explained variance.

Model Comparison

Model	Train R²	Test R²	RMSE (Test)	Notes
Linear Regression	0.65	0.63	$6,012	Underfits — residuals show non-linear pattern
Polynomial Regression (deg 2)	0.84	0.81	$4,847	Best fit — captures BMI and age non-linearity
Polynomial Regression (deg 3)	0.89	0.76	$5,312	Overfitting — train/test gap increases

Degree 2 provided the best generalisation — degree 3 overfit (train R² = 0.89 but test R² fell to 0.76). The 0.03 train/test gap in degree 2 indicates good generalisation without overfitting.

Key Insights

Smoking is the dominant predictor: smokers pay an average of $23,615/year vs $8,434 for non-smokers — a 2.8× difference that dwarfs all other factors.
BMI above 30 (obese classification) is associated with a $4,200 average premium increase, and the effect is non-linear — polynomial terms captured an accelerating cost increase at higher BMI values.
Southeast region has the highest average charges at $14,735, compared to $12,346 (Southwest), $13,406 (Northeast), and $12,417 (Northwest) — a 20% regional premium gap worth investigating for pricing strategy.
Age-charge relationship is non-linear: charges increase more steeply after age 45, which the polynomial model captures — linear regression systematically under-predicted for older customers.
Children count adds a small but statistically significant cost increment (~$475 per child on average).

Technical Implementation

Data Preprocessing:

No missing values in the dataset. Checked for outliers using IQR method — kept as they represent real high-cost patients.
One-hot encoded: region (4 categories → 3 dummies), sex (binary). Label encoded: smoker (yes/no → 1/0).

Feature Engineering:

Applied PolynomialFeatures(degree=2, interaction_only=False) from sklearn — generates all polynomial and interaction terms from the original 6 features.
Explored the smoker × BMI interaction term — statistically significant: obesity amplifies the smoking premium.

Model Evaluation:

80/20 train/test split with random state fixed for reproducibility.
5-fold cross-validation confirmed test R² = 0.81 ± 0.03 (stable across folds).
RMSE of $4,847 on test set — model predictions are within ~$5k of actual charges on average.

Key Learnings

Residual plots are essential before choosing model complexity — the non-random pattern in linear regression residuals (funnel shape with BMI) directly indicated the need for polynomial terms. This is how polynomial regression should always be motivated, not by trial and error.
Overfitting risk is real even with 6 features — polynomial degree 3 added 83 features to a 1,338-row dataset, causing train/test gap to increase. Regularisation (Ridge/Lasso) would be the next step before degree 3.
Interaction terms reveal business insights — the smoker × BMI interaction showing that obesity compounds the smoking cost increase is not just a statistical artefact; it's actionable for underwriting teams.

Future Work

Apply Ridge regression to degree-3 polynomial features — regularisation might allow higher-order terms without overfitting, potentially improving the test R².
Evaluate gradient boosting models (XGBoost) as a non-parametric baseline — they may naturally capture the non-linearities without requiring explicit polynomial feature creation.
Add confidence intervals on predictions — for insurance pricing, a point estimate without uncertainty quantification is insufficient for actuarial use.

GitHub

Built by Om Patel — ML Engineer & Data Scientist.
Explore more projects on my Portfolio.